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Statistical significance and confidence intervals 


X JW" mny papers in fhc Journal use 

l^/I •laiiKicai methods and one of th« 
X ▼ aims of ihe review procoi is to cry 
to ensure that appropriate methods have 
been used. Often papers report results of 
comparative studies that art designed to 
answer questions such as whether one 
treatment is superior to another for a 
particular disease, or whether there is an 
association between some form of behaviour 
(for example, taking regular exercise or 
smoking) and the occurrence of some 
disease. Comparative studies are almost 
invariably carried out on a sample of 
individuals who are chosen from the 
population of individuals to whom it is 
intended to generalize the results. Data are 
collected on the sample in order to make 
inferences on the population. Valid 
inferences can only be drawn if the sample 
is chosen in such a way that it is represen¬ 
tative of the population. Otherwise a bias 
could occur; epidemiological methods are 
designed to eliminate such biases. 

Since the aim of a statistical analysis is to 
make inferences, it is paramount to express 
whatever inferences that can be drawn in the 
most informative way: There are several 
methods of statistical inference, but the two 
that are most commonly used are 
significance testing and confidence interval 
estimation. The former is well known and 
is featured by quoting P values. Many 
authors appear to be under the impression 
that a profusion of P values is necessary: 
regrettably this impression has been bolstered 
in the past by editors of biological (journals. 
Significance testing has its place but. as 
mentioned by Healy in 1978,' “it is widely 
agreed among statisticians (if less so among 
the more naive users of statistics) that 
significance testing is not the be-all and end* 
all of the subject". In this leading article I 
would like to discuss the characteristics of; 
both methods of inference, show that a 
confidence interval contains the result of a 
significance test, but not vice versa, and 
suggest that confidence intervals are the 
answers to the more interesting questions 
that data can be used to answer. 

Any particular study is based on a 
particular sample; however, it is useful to 
imagine that the study is repeated with a 
different sample being selected each time. 
These hypothetical studies will give different 
results because They contain different 
individuals, and individuals vary in any 
characteristic because of biological varia¬ 
bility. The differences are termed sampling 
variability. It follows then that the results 
that are obtained from a particular sample 
can only be taken as an approximation to the 
actual situation in the whole population. 
Statistical methods are concerned wt»K 
assessing the degree of approximation and 


what may be reasonably inferred, given that 
a different sample would have produced a 
different result; 

The methods are based on the assumption 
that it is a matter of chance which particular 
subjects art in the sample that is being 
studied, and the sampling variability is thus 
random variation which is determined by the 
laws of probability. Therefore, the inferences 
art expressed Ih terms of probability. The 
situation is illustrated below. 
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Inferences on population 

Taking a samplfe from the population 
involves sampling variation. As a conse¬ 
quence of this, inferences from the sample 
data back to the population involve 
uncertainty. 

A statistical analysis may be thought of as 
asking questions of the data, in an investi¬ 
gation that compares two groups for the 
mean value of. for example, blood pressure 
or the prevalence of some disease, three 
questions may be posed: Is there a difference 
between the groups?; How large ts the 
difference?; and How accurately is the size 
of the difference known?. 

As expressed, ihe first question expects the 
answer “yes*' or “no"; although the answer 
cannot be given in precisely these terms, it 
is often reduced to two possibilities. The 
appropriate methodology is the Significance 
test. The second question expeas a numerical 
value to be the answer. This is an estimate 
and. as it is a single value, is referred to as 
a point estimate. In effect, the third question 
asks how reliable this point estimate is; the 
answer is a range of values which is referred 
to as an interval estimate or a confidence 
interval. 

These questions represent two approaches 
to inference: hypothesis testing and 
estimation. Although at first sight they 
appearro be quite different; in concept they 
have much in common. Both make 
inferential statements about the value of a 
parameter. (A parameter is an unknown 
quantity which partly or wholly characterizes 
a population, for example, a mean or a 
measure of association.) 

The significance test is an appropriate 
technique when there is an a priori hypothesis 
to test. For the purpose of the statistical text 
this hypothesis is expressed in null form — 
such as when no difference exists between 
groups — and the test evaluates whether the 


data are consistent with the null hypothesis. 
If the data differ markedly from those which 
would be expected under the null hypothesis, 
to the extent that ike probability of such an 
extreme result is low. then it is said that the 
result is statistically significant. Probability 
is measured on a continuum between 0 and 
1; but in significance testing a probability is 
considered low if it is less than conventional 
values such as 0.05 (5%) or 0.01 (I*). A 
significant result is equated with the rejection 
of the null hypothesis or the claim of a real 
effect. By definition, when the null 
hypothesis is true, significant results will 
occur by chance with the same relative 
frequency as the significance probability. 
That is. real effects will be claimed when the 
null Hypothesis is true; however, the proba¬ 
bility of this error (type 1) is determined in 
the data analysis. 

One disadvantage of a significance test is 
that it may fail to detect a real effect; that 
is. although the null hypothesis is false, the 
evidence ts not strong enough to reject it. The 
probability of this enor (type II) can be 
controlled at the design stage only, by 
appropriate selection of the samplfc size, and 
may be quite large. Thus, the trap of 
equating non-significance with no effect 
must be avoided; failure to reject the null 
hypothesis is not the same as accepting it. 

In the approach of confidence interval 
estimation no particular hypothesis is consi¬ 
dered; rather, the emphasis is on estimating 
those values of the parameter with which the 
data are consistent. These valiies form a 
range — the confidence interval. The range 
is calculated so that there is a high proba¬ 
bility — conventionally 95*?i or99*t — that 
it contains the true value of the parameter. 

A significance test is essentially a test of 
whether the data are consistent with a 
specified parameter value, and the confi¬ 
dence interval contains those parameter 
values with which the data are consistent. 
Therefore, a 5^ significance test anda95^t 
confidence interval contain some infor¬ 
mation in common: significance implies that 
the null hypothesis value is outside the confi¬ 
dence interval; non-significance implies that 
the null hypothesis value is within the confi¬ 
dence interval. However, the confidence 
interval contains more information because 
it ts equivalent to performing a significance 
test for all values of (he parameter, not just 
a single value. A confidence interval enables 
a reader to see how large the effect may be, 
not simply whether it is different from zero. 

The limitations of the interpretations that 
are provided by a significance test may now 
be considered. 

The difference ts significant. Thu means 
that there is a difference or, in other words, 
the size of the difference it not zero. We 
know no more than this. The difference may 
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while Freimcn et al. noted thai “negatm” 
trials were often too small to coossute a fair 


be Large and of great importance or it may 
be small and of no practical importance. It 
is unsatisfactory that the test provides no way 
of distinguishing between these quite 
different possibilities. 

The difference ii not stgnificanL This 
means that there is insufficient evidence to 
enable us to conclude that there is a 
difference. So the difference may well be 
zero. But this is not the same as saying that 
it is zero. The true difference may be quite 
large. Again, it is unsatisfactory that this 
possibility is uoi addressed. 

The conclusions that may be drawn from 
a significance test are considered to be 
incomplete because it is rarely that one is 
interested solely in whether a null hypothesis 
is or is not true; indeed in many cases it may 
be recognized at the outset that the null 
hypothesis is unlikely to be true. Rather, the 
question is how large is the difference and 
is it possibly large enough to be important? 
The emphasis is on measuring rather than on 
testing. The addition of the concept of an 
important difference to that of a null 
hypothesis means that there are four possible 
interpretations to an analysis: fa) the 
difference is significant and large enough to 
be of practical importance:#*; the difference 
is significant but too small to be of practical 
importance; (c) the difference is not 
significant but may be large enough to be 
important: and fd) the difference is not 
significant and also not large enough to be 
of practical importance. 
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The size of difference that is considered 
to be large enough to be important is a 
matter for debate, and genuine differences 
of opinion may arise. It is a medical, not a 
statistical, question, although a medical 
statistician who is experienced in the subject 
area could contribute to setting a value. The 
fact that agreement on a unique value may 
be impossible in no way detracts from the 
argument. In fact, expressing the results as 
a confidence interval enables interpretations 
to be made for any particular value that is 
considered appropriate. 

These possibilities are illustrated in the 
Figure where the confidence intervals are 
shown. The significant and nonsignificant 
cases are distinguished by the confidence 
intervals that exclude or include zero respec¬ 
tively. The main point is that in each case 
the confidence interval i gives the range of 
possible values for the true difference. Of 
particular concern is fc). Here there may be 
no true difference or there may be a large, 
important difference. In other words the 
study is completely inconclusive. Such a 
possibility is missed by the simple expression 
“not significant” with its lure of equating 
this falsely with “no effect". This situation 
will arise with a study that is carried out on 
too small a sample and this is why good study 
design demands attention to sample size to 
try to prevent the occurrence of an incon¬ 
clusive result. Altman found that it was 
common for undue emphasis to be placed on 
“negative” findings from small studies , 1 


test of therapies.' Similarly, a significance 
lest will contrast (b) as significant and fd) as 
doc significant but fads to rec ognize dm they 
give essentially the same coo da sk m — dm 
any difference is too small to be important. 

As an example, consider scene results 
which were obtained by G amw ay ct aL from 
a dinicaJ trial for the management of acme 
stroke in the elderly.* Of 155 patterns who 
were managed in a stroke unit. 71 were 
assessed as independent when they were 
discharged from the unit compared with 49 
of 152 who were managed in a medical unit. 
The simplest analysis shows that the 
difference be tw ee n the su cc ess rates of the 
two units is significant at the 1% level. 
Therefore, a genuine effect has been estab¬ 
lished. To appreciate the importance of this 
effect the advantage of the stroke unit may 
be measured by the difference bet w een the 
two units in the percentage of subjects 
who were discharged as independent: 
50.3% - 32.2% - )M%. This is the point 
estimate. The accuracy of this estimate is 
given by its standard error (5.5) and the 95% 
confidence limitt (7.3% and 21.9%). Thus, 
the gain could be as large as 29% or as small 
as 7%. 

Recently, Gardner and Ahman have 
argued against the excessive use of hypothesis 
testing and urged a greater use of confidence 
intervals. 1 In an appendix to their paper they 
give methods to calculate confidence 
intervals for the commonly occurring two- 
sampk comparisons. 

In presenting the main results of a study 
it is good practice to provide confidence 
intervals rather than to restrict the analysis 
to significance tests. Only by so doing can 
authors give readers sufficient Information 
for a proper conclusion to be drawn; 
otherwise readers have to rely upon the 
authors* own interpretation.* Therefore, 
intending authors are urged to express their 
main conclusions in confidence interval form 
(possibly with the addition of a significance 
test, although strictly that would provide no 
extra information). One of the aims of the 
Journal's statistical review process will be to 
ensure that where possible this is done. 
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